There are various reasons why voter prediction is such a hard problem. Even though polls are taken on voter preference before elections happen, it is still difficult to predict voter behavior when election day comes. Perhaps the most obvious of these reasons is that people may not be honest when answering polls because they are afraid of judgement if the candidate they plan on voting for is a controversial one, and so support for this person could be underestimated when making predictions based on the polls. Another possibility is that people who have answered the poll on who they would vote for, don’t show up to vote. With a non-standard presidential candidate like Trump, opinions didn’t necessarily fall into the usual Republican vs Democrat binary,so it is hard to predict voter behavior based on past elections. Also, it is difficult to predict what events in politics or occurences in the world (for example a financial crash or natural disaster) may happen between the polls and election to change people’s intentions.
A lot of election predictions choose certain polls which they will feature, while excluding others because they are seen as biased. However, Nate Silver included many different polls in his prediction because information can still be extracted from biased data (such as the changes in voter intentions over time from one polling source). He also included data from sources other than polls from that election cycle, including historical polls and previous election results, to help guide how the current polls would relate to actual voter behavior. In addition to these methods, he also created a prediction model loosely based on Bayes’ Theorem that continually updated by day and took into account the current day’s support numbers in order to calculate a new probability for each candidate. Compared to typical prediction methods which average the polls, Nate Silver used more sophisticated statistical techniques, regression models and even Monte Carlo simulations for the electoral college.
Alot went wrong in 2016 to cause the election prediction trouble. There was a great deal of uncertainty in the polls and therefore, people didn’t make their minds up until the last minute. It turned out that these voters were more likely to vote for Trump rather than Clinton. Although she may have been won the most polls, it was only by a small margin and predictions didn’t take into account the uncertainties or the way that the voter behavior would translate into the electoral college voting.In spite of this Clinton still won the popular vote, but not the overall election. Future predictions should communicated better with the media as many reporters and other news officials interpreted a 70% chance of Clinton winning as being the same as a sure win. Whether this was due to the lack of understandable communication from the statistician, or excessive assumption from the media is up for debate. Regardless, there needs to be clarity as to how uncertain a prediction is in order for the media to know how much relevance to place upon it.
#this code is given within the project
election.raw = read.csv("data/election/election.csv") %>% as.tbl
census_meta = read.csv("data/census/metadata.csv", sep = ";") %>% as.tbl
census = read.csv("data/census/census.csv") %>% as.tbl
census$CensusTract = as.factor(census$CensusTract)
head(election.raw) #examine the first few rows
## # A tibble: 6 x 5
## county fips candidate state votes
## <fct> <fct> <fct> <fct> <int>
## 1 <NA> US Donald Trump US 62984825
## 2 <NA> US Hillary Clinton US 65853516
## 3 <NA> US Gary Johnson US 4489221
## 4 <NA> US Jill Stein US 1429596
## 5 <NA> US Evan McMullin US 510002
## 6 <NA> US Darrell Castle US 186545
Each column in the dataset has a clear meaning except for the “fips” column. The acronym fips stands for Federal Information Processing Standard. This value denotes the area (US, state, or county) that each row of data represents. It can be used to break the data apart into federal, state, and county level data.
#creating the federal dataframe
election_federal <- filter(election.raw, is.na(county) & fips == "US")
#creating the state level dataframe
election_state <- filter(election.raw, is.na(county) & election.raw$fips != "US" & as.character(election.raw$fips) == as.character(election.raw$state))
#both as.character equality does observationwise comparisons
#creating the county level data dataframe (election)
election <- filter(election.raw, election.raw$fips != "US" & as.character(election.raw$fips) != as.character(election.raw$state))
#check to make sure no overlapping and every observation is present
dim(election_federal)[1] + dim(election_state)[1] + dim(election)[1] == dim(election.raw)[1]
## [1] TRUE
There is no apparent overlap of the data and there is no missing data so this step has been carried out correctly.
#getting the number of candidates
dim(table(election$candidate))
## [1] 32
There are 31 different candidates along with an option for “none of these candidates” to combine all other non-major candidates. This makes 32 different candidate options overall.
ggplot(data = election) +
geom_bar(
mapping = aes(x = election$candidate, y = election$votes/1000000), stat = "identity"
) +
ggtitle("Box Chart of Election Votes by Canidate") +
xlab("Candidate") + ylab("Votes (in millions)") +
coord_flip()
#votes in millions clean up the graph
county_winner <- election %>% #start with election
group_by(fips) %>% #group by fips
mutate(total = sum(votes), pct = votes/total) %>% #compute total votes and pct
top_n(1) #chose highest row
## Selecting by pct
head(county_winner)
## # A tibble: 6 x 7
## # Groups: fips [6]
## county fips candidate state votes total pct
## <fct> <fct> <fct> <fct> <int> <int> <dbl>
## 1 Los Angeles County 6037 Hillary Clinton CA 2464364 135691978 0.0182
## 2 Cook County 17031 Hillary Clinton IL 1611946 135691978 0.0119
## 3 Maricopa County 4013 Donald Trump AZ 747361 135691978 0.00551
## 4 Harris County 48201 Hillary Clinton TX 707914 135691978 0.00522
## 5 San Diego County 6073 Hillary Clinton CA 735476 135691978 0.00542
## 6 Orange County 6059 Hillary Clinton CA 609961 135691978 0.00450
#same as above
state_winner <- election_state %>%
group_by(fips) %>%
mutate(total = sum(votes), pct = votes/total) %>%
top_n(1)
## Selecting by pct
head(state_winner)
## # A tibble: 6 x 7
## # Groups: fips [6]
## county fips candidate state votes total pct
## <fct> <fct> <fct> <fct> <int> <int> <dbl>
## 1 <NA> CA Hillary Clinton CA 8753788 135691978 0.0645
## 2 <NA> FL Donald Trump FL 4617886 135691978 0.0340
## 3 <NA> TX Donald Trump TX 4685047 135691978 0.0345
## 4 <NA> NY Hillary Clinton NY 4556124 135691978 0.0336
## 5 <NA> PA Donald Trump PA 2970733 135691978 0.0219
## 6 <NA> IL Hillary Clinton IL 3090729 135691978 0.0228
states = map_data("state")
ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, fill = region, group = group), color = "white") + coord_fixed(1.3) +
guides(fill=FALSE) # color legend is unnecessary and takes too long
This is the state data map given to us.
county = map_data("county")
ggplot(data = county) + #adjust for county
geom_polygon(aes(x = long, y = lat, fill = subregion, group = group), color = "white") + coord_fixed(1.3) + #also subregion has changed for county level marking
guides(fill=FALSE) # color legend is unnecessary and takes too long
This the map colored and split by county.
fips = state.abb[match(states$region, tolower(state.name))]
states <- states %>%
mutate(fips=fips) #preprocessing to join the data
combined_states <- left_join(states, state_winner, by="fips")
## Warning: Column `fips` joining character vector and factor, coercing into
## character vector
ggplot(data = combined_states) +
geom_polygon(aes(x = long, y = lat, fill = candidate, group = group),color = "white") +
coord_fixed(1.3) + #fill with candidate from combined states data that has the winner
guides(fill= FALSE)
county.1 <- maps::county.fips %>%
separate(polyname, c("region","subregion"), sep=",")
county.2 <- county.1 %>%
separate(subregion, c("subregion","extra"), sep=":") #seperating the subgroups so that the join will work properly
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3069
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
county_fips <- county.2[-4] #remove excess data
# change fips column to factor like the original dataset
county_fips <- county_fips %>% mutate(fips=as.factor(fips))
combined_counties1 <- left_join(county, county_fips, by= c("subregion","region"))
combined_counties2 <- left_join(combined_counties1, county_winner, by="fips")
## Warning: Column `fips` joining factors with different levels, coercing to
## character vector
#plotting the winning candidate for each county
# note county "oglala" (fips = 46102) is greyed out because it is not in maps::county.fips (and thus county_fips) or counties data
ggplot(data = combined_counties2) +
geom_polygon(aes(x = long, y = lat, fill = candidate, group = group), color = "white") +
coord_fixed(1.3) +
guides(fill=FALSE)
county_winner$county[-which(county_winner$fips %in% county.1$fips)]
## [1] <NA> Honolulu County Chesapeake city
## [4] Richmond city Alexandria city Hawaii County
## [7] Maui County Portsmouth city Roanoke city
## [10] Lynchburg city Kauai County Charlottesville city
## [13] Danville city Harrisonburg city Manassas city
## [16] Petersburg city Salem city Fairfax city
## [19] Fredericksburg city Staunton city Winchester city
## [22] Waynesboro city Hopewell city Colonial Heights city
## [25] Falls Church city Williamsburg city Poquoson city
## [28] Bristol city Radford city Martinsville city
## [31] Manassas Park city Franklin city <NA>
## [34] Lexington city Buena Vista city Covington city
## [37] Galax city Emporia city Norton city
## 1846 Levels: Abbeville County Acadia Parish Accomack County ... Ziebach County
For an initial visualization of the data, we can create scatterplots of selected census variables that we think may be relevant, and see if there appears to be a correlation between any of them at a county level
head(census)
## # A tibble: 6 x 37
## CensusTract State County TotalPop Men Women Hispanic White Black Native
## <fct> <fct> <fct> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1001020100 Alab… Autau… 1948 940 1008 0.9 87.4 7.7 0.3
## 2 1001020200 Alab… Autau… 2156 1059 1097 0.8 40.4 53.3 0
## 3 1001020300 Alab… Autau… 2968 1364 1604 0 74.5 18.6 0.5
## 4 1001020400 Alab… Autau… 4423 2172 2251 10.5 82.8 3.7 1.6
## 5 1001020500 Alab… Autau… 10763 4922 5841 0.7 68.5 24.8 0
## 6 1001020600 Alab… Autau… 3851 1787 2064 13.1 72.9 11.9 0
## # … with 27 more variables: Asian <dbl>, Pacific <dbl>, Citizen <int>,
## # Income <dbl>, IncomeErr <int>, IncomePerCap <int>,
## # IncomePerCapErr <int>, Poverty <dbl>, ChildPoverty <dbl>,
## # Professional <dbl>, Service <dbl>, Office <dbl>, Construction <dbl>,
## # Production <dbl>, Drive <dbl>, Carpool <dbl>, Transit <dbl>,
## # Walk <dbl>, OtherTransp <dbl>, WorkAtHome <dbl>, MeanCommute <dbl>,
## # Employed <int>, PrivateWork <dbl>, PublicWork <dbl>,
## # SelfEmployed <dbl>, FamilyWork <dbl>, Unemployment <dbl>
plot(census[,c(8,9:14,18,20,37)])
Based on the existence of some patterns in the scatterplots, we want to take a closer look at the relationship between Unemployment and Percentage of Citizens, Native American/Alaskan, Pacific/Native Hawaiian, and Professionals. There is no need to examine, for example, percentage of white people and percentage of professionals as the scatterplot seems to be pretty evenly distributed.
censusomit <- na.omit(census)
ggplot(data=censusomit,aes(x=Unemployment, y=Citizen*100/TotalPop))+
scale_shape_discrete(solid=F) +
geom_point(aes(color='Citizen', shape='Other Demographics'),alpha=0.5)+
geom_point(data=censusomit,aes(x=Unemployment, y=Professional, color='Professional',shape='Other Demographics'),alpha=0.5)+
geom_point(data=censusomit,aes(x=Unemployment, y=Native, color='Native',shape='Ethnicity'),alpha=0.5)+
geom_point(data=censusomit,aes(x=Unemployment, y=Pacific, color='Pacific',shape='Ethnicity'),alpha=0.5)+
ggtitle("Percentage of Citizens, Native American/Alaskan, Pacific/Native Hawaiian, and \n Professionals against the Unemployment rate of each county") +
ylab('Percentage %')
This graph indicates that there is a strong link between the percentage of professionals in an area and unemployment rates, which implies that areas has a strong significance to job opportunity . In general, areas with either high or low (not around 50%) percentages of Native American/Alaskan people tend to have greater unemployment rates. This is an interesting trend that begs the question of why this phenomena might occur. It could simply be due to the cities locations and possibly be a false trend because of where Native AMerican and Alaskan people live, or there could be some significance behind it. This we need to be further scrutinized and tested to concur any such conclusion.
census.del = census %>%
na.omit() %>%
mutate(Men = (Men/TotalPop)*100, Employed = (Employed/TotalPop)*100, Citizen=(Citizen/TotalPop)*100, Minority=Hispanic+Black+Native+Asian+Pacific) %>%
select(-c(Walk,PublicWork, Construction, Women,Hispanic,Black,Native, Asian,Pacific))
census.subct = census.del[,c(1:6,29,7:28)] %>% #reorder to have the races next to each other like before, instead of having minority at the end
group_by(State,County)%>%
add_tally(TotalPop) %>%
mutate(CountyTotal = n) %>% #to permanently rename
select(-n) %>% #remove unecesarry n because it is repetitive
mutate(weight=TotalPop/CountyTotal)
census.ct = census.subct %>%
summarise_at(vars(Men:CountyTotal),funs(weighted.mean(.,weight)))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
##
## # Before:
## funs(name = f(.)
##
## # After:
## list(name = ~f(.))
## This warning is displayed once per session.
census.ct <- data.frame(census.ct)
head(census.ct)
## State County Men White Minority Citizen Income IncomeErr
## 1 Alabama Autauga 48.43266 75.78823 22.53687 73.74912 51696.29 7771.009
## 2 Alabama Baldwin 48.84866 83.10262 15.21426 75.69406 51074.36 8745.050
## 3 Alabama Barbour 53.82816 46.23159 51.94382 76.91222 32959.30 6031.065
## 4 Alabama Bibb 53.41090 74.49989 24.16597 77.39781 38886.63 5662.358
## 5 Alabama Blount 49.40565 87.85385 10.59474 73.37550 46237.97 8695.786
## 6 Alabama Bullock 53.00618 22.19918 76.53587 75.45420 33292.69 9000.345
## IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service
## 1 24974.50 3433.674 12.91231 18.70758 32.79097 17.17044
## 2 27316.84 3803.718 13.42423 19.48431 32.72994 17.95092
## 3 16824.22 2430.189 26.50563 43.55962 26.12404 16.46343
## 4 18430.99 3073.599 16.60375 27.19708 21.59010 17.95545
## 5 20532.27 2052.055 16.72152 26.85738 28.52930 13.94252
## 6 17579.57 3110.645 24.50260 37.29116 19.55253 14.92420
## Office Production Drive Carpool Transit OtherTransp WorkAtHome
## 1 24.28243 17.15713 87.50624 8.781235 0.09525905 1.3059687 1.8356531
## 2 27.10439 11.32186 84.59861 8.959078 0.12662092 1.4438000 3.8504774
## 3 23.27878 23.31741 83.33021 11.056609 0.49540324 1.6217251 1.5019456
## 4 17.46731 23.74415 83.43488 13.153641 0.50313661 1.5620952 0.7314679
## 5 23.83692 20.10413 84.85031 11.279222 0.36263213 0.4199411 2.2654133
## 6 20.17051 25.73547 74.77277 14.839127 0.77321596 1.8238247 3.0998783
## MeanCommute Employed PrivateWork SelfEmployed FamilyWork Unemployment
## 1 26.50016 43.43637 73.73649 5.433254 0.00000000 7.733726
## 2 26.32218 44.05113 81.28266 5.909353 0.36332686 7.589820
## 3 24.51828 31.92113 71.59426 7.149837 0.08977425 17.525557
## 4 28.71439 36.69262 76.74385 6.637936 0.39415148 8.163104
## 5 34.84489 38.44914 81.82671 4.228716 0.35649281 7.699640
## 6 28.63106 36.19592 79.09065 5.273684 0.00000000 17.890026
## CountyTotal
## 1 55221
## 2 195121
## 3 26932
## 4 22604
## 5 57710
## 6 10678
ct.out=prcomp(census.ct[,3:28], scale=TRUE)
ct.pc = ct.out$rotation[,1:2] #creating the matrix
head(ct.pc[order(-1*abs(ct.pc[,1])),]) #in order of most prominent loadings PC1
## PC1 PC2
## IncomePerCap 0.3530767 0.138901672
## ChildPoverty -0.3421530 0.040582171
## Poverty -0.3405832 0.056023764
## Employed 0.3274294 0.002977684
## Income 0.3225866 0.207405608
## Unemployment -0.2876314 0.158871955
head(ct.pc[order(-1*abs(ct.pc[,2])),]) #in order of most prominent loadings PC2
## PC1 PC2
## IncomeErr 0.17382461 0.3145022
## SelfEmployed 0.09389830 -0.3087990
## CountyTotal 0.06247436 0.2931957
## White 0.21769909 -0.2926768
## Minority -0.22128517 0.2889832
## Transit 0.07653595 0.2777578
subct.out=prcomp(census.subct[,4:31], scale=TRUE)
subct.pc = subct.out$rotation[,1:2] #creating the matrix
head(subct.pc[order(-1*abs(subct.pc[,1])),]) #in order of most prominent loadings PC1
## PC1 PC2
## IncomePerCap 0.3181199 -0.16790832
## Professional 0.3064366 -0.14164991
## Poverty -0.3046886 -0.05142781
## Income 0.3025102 -0.15564296
## ChildPoverty -0.2978867 -0.02723482
## Service -0.2688299 -0.05762976
head(subct.pc[order(-1*abs(subct.pc[,2])),]) #in order of most prominent loadings PC2
## PC1 PC2
## Transit -0.05730847 -0.3937091
## Drive 0.07893438 0.3772060
## White 0.24042491 0.3085397
## Minority -0.24202363 -0.3054452
## CountyTotal -0.02151015 -0.2838430
## MeanCommute 0.01001204 -0.2789246
The most significant loadings for the county level data are IncomePerCap (Income per capita) for PC1 and IncomeErr (Income Error) for PC2. The most significant loadings for the sub-county level data are IncomePerCap (Income per capita) for PC1 and Transit (Percentage of population that commutes by public transportation) for PC2.
set.seed(1)
scalecensus.ct = scale(census.ct[,3:28])
ctdist = dist(scalecensus.ct, method = "euclidean")
ct.hclust = hclust(ctdist, method = "complete")
ctclust = cutree(ct.hclust, k=10)
table(ctclust)
## ctclust
## 1 2 3 4 5 6 7 8 9 10
## 2632 501 6 7 5 1 11 13 38 4
ct.clust = census.ct[,1:28] %>% mutate(Cluster = as.factor(ctclust))
ctsanmateo = ct.clust %>% filter(County == 'San Mateo')
ctsanmateo = ct.clust %>% filter(Cluster == ctsanmateo$Cluster)
head(ctsanmateo) #add hypothesize why
## State County Men White Minority Citizen
## 1 Alabama Lee 49.28856 67.43065 30.82046 73.44046
## 2 Alabama Limestone 50.25168 77.41039 20.40782 74.41698
## 3 Alabama Madison 48.90399 65.61261 31.65187 74.70052
## 4 Alabama Shelby 48.70732 78.81874 19.90744 72.15153
## 5 Alabama St. Clair 50.24341 86.28356 12.79208 76.04002
## 6 Alaska Anchorage Municipality 51.19305 60.33492 31.20105 71.18122
## Income IncomeErr IncomePerCap IncomePerCapErr Poverty ChildPoverty
## 1 47915.99 7666.478 25435.25 3747.141 22.341042 24.60349
## 2 52657.32 9155.641 25568.90 3974.550 14.268159 18.62215
## 3 64645.12 9039.880 32131.20 4082.470 14.013142 19.58885
## 4 73433.95 10859.706 33493.52 4160.205 8.569611 11.07811
## 5 51124.35 9512.729 24167.63 3384.393 16.061939 20.99730
## 6 83256.53 10549.157 36919.88 4854.646 8.189246 12.33076
## Professional Service Office Production Drive Carpool Transit
## 1 39.31037 15.64176 23.00577 13.540929 85.38979 8.122526 0.8573492
## 2 33.81955 15.22677 22.32613 16.000854 87.47610 8.823741 0.2739148
## 3 44.00454 15.80609 23.31452 10.053710 86.72491 7.204684 0.3133698
## 4 42.38809 14.42272 27.10418 8.617651 85.42398 8.148431 0.1669823
## 5 28.43117 15.30582 27.23987 16.314174 85.90453 8.689701 0.1776391
## 6 39.08969 18.08762 24.15257 9.171199 75.21122 11.977503 1.9997439
## OtherTransp WorkAtHome MeanCommute Employed PrivateWork SelfEmployed
## 1 0.9750383 3.126988 22.01869 46.63387 75.99504 4.084038
## 2 0.6308215 2.137847 26.34118 42.41203 77.81109 5.510307
## 3 1.5754273 2.981286 20.80628 47.26762 75.72323 4.741186
## 4 0.8172771 4.838638 28.67166 49.79266 82.40741 4.807864
## 5 1.1006930 3.604652 30.19784 43.06461 81.33290 6.012870
## 6 3.9567616 3.754034 19.64740 50.93662 72.22658 5.586221
## FamilyWork Unemployment CountyTotal Cluster
## 1 0.1234373 7.175399 142879 2
## 2 0.1479320 7.722984 88805 2
## 3 0.1709287 8.651256 346438 2
## 4 0.1417157 5.500144 203530 2
## 5 0.1113948 8.354981 85864 2
## 6 0.1106393 6.701172 299107 2
#using pca
set.seed(1)
ct.pc2 = data.frame(ct.out$x[,1:5])
ct.pc2 = scale(ct.pc2)
ctpcdist = dist(ct.pc2, method = "euclidean")
ctpc.hclust = hclust(ctpcdist, method = "complete")
ctpcclust = cutree(ctpc.hclust,k=10)
table(ctpcclust)
## ctpcclust
## 1 2 3 4 5 6 7 8 9 10
## 2441 525 97 6 8 31 5 18 7 80
ctpc.clust = census.ct[,1:28] %>% mutate(Cluster = ctpcclust)
ctpcsanmateo = ctpc.clust %>% filter(County == 'San Mateo')
ctpcsanmateo = ctpc.clust %>% filter(Cluster == ctpcsanmateo$Cluster)
head(ctpcsanmateo)
## State County Men White Minority Citizen Income IncomeErr
## 1 Alabama Autauga 48.43266 75.78823 22.53687 73.74912 51696.29 7771.009
## 2 Alabama Baldwin 48.84866 83.10262 15.21426 75.69406 51074.36 8745.050
## 3 Alabama Blount 49.40565 87.85385 10.59474 73.37550 46237.97 8695.786
## 4 Alabama Butler 46.68370 53.33139 45.48332 76.55006 32516.68 6584.549
## 5 Alabama Calhoun 48.24080 73.02656 24.86455 75.96447 42423.72 6895.153
## 6 Alabama Chambers 47.70680 57.30429 41.68152 77.64899 34216.94 7550.160
## IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service
## 1 24974.50 3433.674 12.91231 18.70758 32.79097 17.17044
## 2 27316.84 3803.718 13.42423 19.48431 32.72994 17.95092
## 3 20532.27 2052.055 16.72152 26.85738 28.52930 13.94252
## 4 18389.82 2532.536 25.39623 37.28658 26.83399 17.09993
## 5 21372.60 2758.231 21.04747 31.96249 26.71739 18.00230
## 6 21070.99 3936.265 21.59896 37.19649 23.30296 14.68630
## Office Production Drive Carpool Transit OtherTransp WorkAtHome
## 1 24.28243 17.15713 87.50624 8.781235 0.09525905 1.3059687 1.835653
## 2 27.10439 11.32186 84.59861 8.959078 0.12662092 1.4438000 3.850477
## 3 23.83692 20.10413 84.85031 11.279222 0.36263213 0.4199411 2.265413
## 4 22.22400 23.91135 84.50587 12.616203 0.00000000 0.5309767 1.606397
## 5 24.19005 20.66590 85.10516 9.626817 0.22255924 1.2204263 2.630111
## 6 25.86724 24.79955 84.83892 12.181581 0.18994102 0.3551777 2.080313
## MeanCommute Employed PrivateWork SelfEmployed FamilyWork Unemployment
## 1 26.50016 43.43637 73.73649 5.433254 0.00000000 7.733726
## 2 26.32218 44.05113 81.28266 5.909353 0.36332686 7.589820
## 3 34.84489 38.44914 81.82671 4.228716 0.35649281 7.699640
## 4 24.56754 38.38558 77.95180 6.133266 0.20386656 11.256711
## 5 23.99630 40.63732 74.42687 5.046385 0.10754175 12.387189
## 6 25.17046 40.16843 85.32039 2.755389 0.01137651 9.176519
## CountyTotal Cluster
## 1 55221 1
## 2 195121 1
## 3 57710 1
## 4 20354 1
## 5 116644 1
## 6 34079 1
sanmateocompare = tibble(Data = colnames(census.ct[3:28]), Native = colMeans(ctsanmateo[,3:28]),PCA = colMeans(ctpcsanmateo[,3:28]))
sanmateocompare
## # A tibble: 26 x 3
## Data Native PCA
## <chr> <dbl> <dbl>
## 1 Men 49.9 49.5
## 2 White 80.2 79.1
## 3 Minority 17.6 19.0
## 4 Citizen 74.2 74.9
## 5 Income 63296. 49249.
## 6 IncomeErr 9055. 7244.
## 7 IncomePerCap 31677. 25040.
## 8 IncomePerCapErr 4086. 3196.
## 9 Poverty 11.2 16.1
## 10 ChildPoverty 13.9 21.9
## # … with 16 more rows
In the first iteration of clustering using ct.census data normally, you can see that the San Mateo County is placed in cluster 2 with every observation of it. On the other hand, the PCA iteration of the clustering places the San Mateo County observations in cluster one. The discrepency in which cluster the county is placed in could be due to the fact that the PCA with 5 principle components may not account for enough of the variance in the data to accurately cluster the dataset.
All the code below is given within the problem
tmpwinner = county_winner %>% dplyr::ungroup() %>%
mutate(state = state.name[match(state, state.abb)]) %>% ## state abbreviations
mutate_at(vars(state, county), tolower) %>% ## to all lowercase
mutate(county = gsub(" county| columbia| city| parish", "", county)) ## remove suffixes
tmpcensus = census.ct %>% dplyr::ungroup() %>% mutate_at(vars(State, County), tolower)
election.cl = tmpwinner %>%
left_join(tmpcensus, by = c("state"="State", "county"="County")) %>%
na.omit
## saves meta information to attributes
attr(election.cl, "location") = election.cl %>% select(c(county, fips, state, votes, pct))
election.cl = election.cl %>% select(-c(county, fips, state, votes, pct))
set.seed(10)
n = nrow(election.cl)
in.trn= sample.int(n, 0.8*n)
trn.cl = election.cl[ in.trn,]
tst.cl = election.cl[-in.trn,]
set.seed(20)
nfold = 10
folds = sample(cut(1:nrow(trn.cl), breaks=nfold, labels=FALSE))
calc_error_rate = function(predicted.value, true.value){
return(mean(true.value!=predicted.value))
}
records = matrix(NA, nrow=2, ncol=2)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","knn")
head(trn.cl)
## # A tibble: 6 x 28
## candidate total Men White Minority Citizen Income IncomeErr
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Donald T… 1.36e8 48.7 86.9 11.2 76.2 46162. 6628.
## 2 Donald T… 1.36e8 49.6 82.2 15.7 75.3 51039. 7356.
## 3 Donald T… 1.36e8 47.8 49.8 48.8 75.0 28426. 7177.
## 4 Donald T… 1.36e8 57.9 50.1 49.6 70.7 26495. 4094.
## 5 Donald T… 1.36e8 47.7 87.6 11.9 82.5 50909. 12800.
## 6 Donald T… 1.36e8 47.9 65.6 32.6 77.7 33943. 6736.
## # … with 20 more variables: IncomePerCap <dbl>, IncomePerCapErr <dbl>,
## # Poverty <dbl>, ChildPoverty <dbl>, Professional <dbl>, Service <dbl>,
## # Office <dbl>, Production <dbl>, Drive <dbl>, Carpool <dbl>,
## # Transit <dbl>, OtherTransp <dbl>, WorkAtHome <dbl>, MeanCommute <dbl>,
## # Employed <dbl>, PrivateWork <dbl>, SelfEmployed <dbl>,
## # FamilyWork <dbl>, Unemployment <dbl>, CountyTotal <dbl>
trn.cl <- trn.cl %>% select(-total)
tst.cl <- tst.cl %>% select(-total)
tree.winner = tree(candidate ~., data = trn.cl)
summary(tree.winner)
##
## Classification tree:
## tree(formula = candidate ~ ., data = trn.cl)
## Variables actually used in tree construction:
## [1] "Transit" "White" "Unemployment" "Citizen"
## [5] "Professional" "Drive" "Employed" "Production"
## [9] "CountyTotal"
## Number of terminal nodes: 16
## Residual mean deviance: 0.3147 = 767.8 / 2440
## Misclassification error rate: 0.05904 = 145 / 2456
cv.tree.winner = cv.tree(tree.winner, rand=folds, FUN=prune.misclass, K=10)
best.cv = cv.tree.winner$size[which.min(cv.tree.winner$dev)]
best.cv
## [1] 9
winner.prune = prune.misclass(tree.winner, best=best.cv)
draw.tree(tree.winner, cex=0.5,nodeinfo=TRUE) #unpruned
title("Unpruned Tree")
draw.tree(winner.prune, cex=0.5,nodeinfo=TRUE) #pruned
title("Pruned Tree")
tree.pred.trn = predict(winner.prune, trn.cl, type="class")
tree.pred.tst = predict(winner.prune, tst.cl, type="class")
true.winner.trn = trn.cl$candidate
true.winner.tst = tst.cl$candidate
records[1,1] = calc_error_rate(true.winner.trn, tree.pred.trn)
records[1,2] = calc_error_rate(true.winner.tst, tree.pred.tst)
records
## train.error test.error
## tree 0.05944625 0.06840391
## knn NA NA
# do.chunk() for k-fold Cross-validation
do.chunk <- function(chunkid, folddef, Xdat, Ydat, k){ # Function arguments
train = (folddef!=chunkid) # Get training index
Xtr = Xdat[train,] # Get training set by the above index
Ytr = Ydat[train] # Get true labels in training set
Xvl = Xdat[!train,] # Get validation set
Yvl = Ydat[!train] # Get true labels in validation set
predYtr = knn(train=Xtr, test=Xtr, cl=Ytr, k) # Predict training labels
predYvl = knn(train=Xtr, test=Xvl, cl=Ytr, k) # Predict validation labels
data.frame(fold = chunkid, # k folds
train.error = mean(predYtr != Ytr), # Training error for each fold
val.error = mean(predYvl != Yvl)) # Validation error for each fold
}
# Set error.folds (a vector) to save validation errors in future
error.folds = NULL
# Give possible number of nearest neighbours to be considered
allK = 1:50
# Set seed since do.chunk() contains a random component induced by knn()
set.seed(3)
# Loop through different number of neighbors
for (j in allK){
tmp = ldply(1:nfold, do.chunk, # Apply do.chunk() function to each fold
folddef=folds, Xdat=trn.cl[,3:27], Ydat=trn.cl$candidate, k=j)
# Necessary arguments to be passed into do.chunk
tmp$neighbors = j # Keep track of each value of neighors
error.folds = rbind(error.folds, tmp) # combine results
}
val.error.means = error.folds %>%
# Select all rows of validation errors
select(neighbors, train.error, val.error) %>%
# Group the selected data frame by neighbors
group_by(neighbors) %>%
# Calculate CV error rate for each k
summarise_each(funs(mean), train.error,val.error) %>%
# Remove existing group
ungroup()
minvalerror = val.error.means %>% filter(val.error==min(val.error))
minvalerror
## # A tibble: 1 x 3
## neighbors train.error val.error
## <int> <dbl> <dbl>
## 1 10 0.111 0.117
set.seed(99)
pred.train = knn(train=trn.cl[,3:27], test=trn.cl[,3:27], cl=trn.cl$candidate, k=11)
pred.Test = knn(train=trn.cl[,3:27], test=tst.cl[,3:27], cl=trn.cl$candidate, k=11)
records[2,1] = calc_error_rate(true.winner.trn,pred.train)
records[2,2] = calc_error_rate(true.winner.tst, pred.Test)
records
## train.error test.error
## tree 0.05944625 0.06840391
## knn 0.11156352 0.11889251
head(error.folds)
## fold train.error val.error neighbors
## 1 1 0.0000000000 0.1463415 1
## 2 2 0.0009049774 0.2154472 1
## 3 3 0.0004522840 0.1591837 1
## 4 4 0.0009049774 0.1463415 1
## 5 5 0.0013568521 0.1877551 1
## 6 6 0.0004524887 0.1747967 1
head(val.error.means)
## # A tibble: 6 x 3
## neighbors train.error val.error
## <int> <dbl> <dbl>
## 1 1 0.000588 0.171
## 2 2 0.0851 0.171
## 3 3 0.0873 0.132
## 4 4 0.0995 0.134
## 5 5 0.100 0.126
## 6 6 0.106 0.122
ggplot(val.error.means, aes(x=neighbors,y=val.error))+ geom_point(aes(color='Validation Error'))+geom_line(aes(color='Validation Error'))+geom_point(aes(x=neighbors,y=train.error, color='Training Error'))+geom_line(aes(x=neighbors,y=train.error, color='Training Error'))+ labs(title ="Number of Neighbours vs Training and Validation Errors", x = "Number of Neighbours", y = "Error")
Below code is given for 15
pca.records = matrix(NA, nrow=2, ncol=2)
colnames(pca.records) = c("train.error","test.error")
rownames(pca.records) = c("tree","knn")
My code for 15
trn.cl.covariates <- trn.cl %>% select(-candidate)
trn.clY <- trn.cl$candidate
tst.cl.covariates <- tst.cl %>% select(-candidate)
tst.clY <- tst.cl$candidate
training_pca <- prcomp(trn.cl.covariates, scale = TRUE)
training_pca_var <- training_pca$sdev^2 #pca variance
training_pca_propvar <- training_pca_var / sum(training_pca_var) #proportion of variance
which(cumsum(training_pca_propvar)>= 0.90)[1]
## [1] 14
The number of minimum principle components needed to capture 90 percent of the variance is 14.
plot(cumsum(training_pca_propvar), main = "Total Proportion of Variance Explained", xlab = "Principal Component", ylab = "Proportion of Variance Explained" ,type = "b")
#for training
trn.pc <- data.frame(training_pca$x) #xvals
train.pca <- trn.pc %>% mutate(candidate=trn.cl$candidate)
#for test
test_pca <- prcomp(tst.cl.covariates, scale=TRUE) #so its just like training_pca from 15
tst.pc <- data.frame(test_pca$x) #xvals
test.pca <- tst.pc %>% mutate(candidate=tst.cl$candidate)
#unpruned tree
OGtree <- tree(candidate~., train.pca)
summary(OGtree)
##
## Classification tree:
## tree(formula = candidate ~ ., data = train.pca)
## Variables actually used in tree construction:
## [1] "PC2" "PC1" "PC3" "PC17" "PC18" "PC4" "PC10" "PC6"
## Number of terminal nodes: 15
## Residual mean deviance: 0.4579 = 1118 / 2441
## Misclassification error rate: 0.09406 = 231 / 2456
draw.tree(OGtree, nodeinfo = TRUE, cex = 0.5)
title("Unpruned Tree")
#prune with cross validation and misclassification error
cv.tree.winner2 = cv.tree(OGtree, rand=folds, FUN=prune.misclass, K=10)
best.cv2 = min(cv.tree.winner2$size[which(cv.tree.winner2$dev==min(cv.tree.winner2$dev))])
best.cv2
## [1] 11
#pruned tree
Pruner <- prune.misclass(OGtree,best = best.cv2)
draw.tree(Pruner, nodeinfo = TRUE, cex = 0.5)
title("Pruned Tree")
tr.pcaX <- trn.pc
tr.pcaY <- train.pca$candidate
test.pcaX <- tst.pc
test.pcaY <- test.pca$candidate
#predictions
tree.pred.trn1 = predict(Pruner, tr.pcaX, type="class")
tree.pred.tst1 = predict(Pruner, test.pcaX, type="class")
#errors
pca.records[1,1] = calc_error_rate(tr.pcaY, tree.pred.trn1)
pca.records[1,2] = calc_error_rate(test.pcaY, tree.pred.tst1)
pca.records
## train.error test.error
## tree 0.09405537 0.1026059
## knn NA NA
# Set error.folds (a vector) to save validation errors in future
error.folds = NULL
# Give possible number of nearest neighbours to be considered
allK = 1:50
# Set seed since do.chunk() contains a random component induced by knn()
set.seed(888)
# Loop through different number of neighbors
for (j in allK){
tmp = ldply(1:nfold, do.chunk, folddef=folds, Xdat=tr.pcaX, Ydat= tr.pcaY, k=j)
tmp$neighbors = j
error.folds = rbind(error.folds, tmp)}
errors = melt(error.folds, id.vars=c('fold', 'neighbors'), value.name='error')
# Choose the number of neighbors which minimizes validation error
val.error.means = errors %>%
# Select all rows of validation errors
filter(variable=='val.error') %>%
# Group the selected data frame by neighbors
group_by(neighbors, variable) %>%
# Calculate CV error rate for each k
summarise_each(funs(mean), error) %>%
# Remove existing group
ungroup() %>%
filter(error==min(error))
#same for training
train.error.means <- errors %>%
filter(variable=="train.error") %>%
group_by(neighbors, variable) %>%
summarise_each(funs(mean), error)
# Best number of neighbors
# if there is a tie, pick larger number of neighbors for simpler model
numneighbor = max(val.error.means$neighbors)
numneighbor
## [1] 7
#predictions
pred.YTrain = knn(train=tr.pcaX, test=tr.pcaX, cl=tr.pcaY, k=numneighbor)
pred.YTest = knn(train=tr.pcaX, test=test.pcaX, cl=tr.pcaY, k=numneighbor)
#error rates
pca.records[2,1] = calc_error_rate(tr.pcaY,pred.YTrain)
pca.records[2,2] = calc_error_rate(test.pcaY, pred.YTest)
pca.records
## train.error test.error
## tree 0.09405537 0.1026059
## knn 0.05618893 0.1058632
head(combined_counties2) #replace with data on correct prediction candidate=yes or no use decision tree and knn
## long lat group order region subregion fips county
## 1 -86.50517 32.34920 1 1 alabama autauga 1001 Autauga County
## 2 -86.53382 32.35493 1 2 alabama autauga 1001 Autauga County
## 3 -86.54527 32.36639 1 3 alabama autauga 1001 Autauga County
## 4 -86.55673 32.37785 1 4 alabama autauga 1001 Autauga County
## 5 -86.57966 32.38357 1 5 alabama autauga 1001 Autauga County
## 6 -86.59111 32.37785 1 6 alabama autauga 1001 Autauga County
## candidate state votes total pct
## 1 Donald Trump AL 18172 135691978 0.000133921
## 2 Donald Trump AL 18172 135691978 0.000133921
## 3 Donald Trump AL 18172 135691978 0.000133921
## 4 Donald Trump AL 18172 135691978 0.000133921
## 5 Donald Trump AL 18172 135691978 0.000133921
## 6 Donald Trump AL 18172 135691978 0.000133921
treecounties = predict(winner.prune, election.cl, type="class")
tree.counties=tibble(pred=treecounties, fips=attributes(election.cl)$location$fips)
head(tree.counties)
## # A tibble: 6 x 2
## pred fips
## <fct> <fct>
## 1 Hillary Clinton 6037
## 2 Hillary Clinton 17031
## 3 Hillary Clinton 4013
## 4 Hillary Clinton 48201
## 5 Hillary Clinton 6073
## 6 Hillary Clinton 6059
tree.countiesmap = combined_counties2
tree.countiesmap = left_join(tree.countiesmap, tree.counties, by = 'fips')
## Warning: Column `fips` joining character vector and factor, coercing into
## character vector
tree.countiesmap = tree.countiesmap %>% mutate(Correct = candidate==pred)
ggplot(data = tree.countiesmap) +
geom_polygon(aes(x = long, y = lat, fill = Correct, group = group), color = "white") +
coord_fixed(1.3) +
guides(fill=FALSE) #change color to green and red??
This map uses our decision tree from #13 to show the counties where the winner was correctly predicted in blue, and the counties where the winner was not predicted in red.
cluster.winner = tibble(Winner = election.cl$candidate,County = attributes(election.cl)$location$county,State = attributes(election.cl)$location$state )
cluster.winner<- na.omit(cluster.winner)
ct.clust = ctpc.clust %>% na.omit() %>% ungroup() %>% mutate_at(vars(State,County), tolower) %>% ## to all lowercase %>%
mutate(County = gsub(" county| columbia| city| parish", "", County)) ## remove suffixes
ct.clust %>% filter(Cluster=='7')
## State County Men White Minority Citizen Income
## 1 california los angeles 49.20333 26.91140 70.65879 60.13610 61385.25
## 2 florida miami-dade 48.43111 15.00706 84.21667 57.54322 49421.74
## 3 illinois cook 48.45616 43.08770 55.23620 66.83787 59825.22
## 4 texas dallas 49.33093 31.38022 66.77038 57.62629 55986.93
## 5 texas harris 49.73236 31.59201 66.88452 57.20801 61649.70
## IncomeErr IncomePerCap IncomePerCapErr Poverty ChildPoverty
## 1 10571.774 28389.55 4001.736 18.23464 23.51242
## 2 9375.868 23899.19 3962.068 20.47254 26.92236
## 3 10306.387 31015.38 4414.122 17.11807 22.47922
## 4 9010.894 27363.26 3913.568 19.35673 27.45469
## 5 10103.385 28914.86 4116.964 17.98876 25.41255
## Professional Service Office Production Drive Carpool Transit
## 1 34.00558 19.50154 24.67255 13.66206 72.90224 10.269109 6.952143
## 2 30.41751 21.44393 28.06933 10.60249 76.69790 9.400059 5.721978
## 3 35.89117 19.25589 24.56791 13.92867 62.69589 8.941790 18.211843
## 4 30.98924 18.88051 24.44867 13.61059 77.91109 11.544363 3.095128
## 5 32.85313 17.94925 23.48078 13.56786 78.98413 11.350972 2.948860
## OtherTransp WorkAtHome MeanCommute Employed PrivateWork SelfEmployed
## 1 2.220029 4.911887 30.14618 46.29416 79.12180 9.191318
## 2 1.859301 4.275635 30.04616 45.76468 81.70917 7.748119
## 3 2.071736 4.062571 32.72339 47.05199 83.48544 4.483416
## 4 2.162039 3.855033 26.88649 47.97452 83.88569 6.418356
## 5 2.018704 3.242627 28.38802 47.81813 83.31643 6.387799
## FamilyWork Unemployment CountyTotal Cluster
## 1 0.1575896 10.134648 9995004 7
## 2 0.1576626 10.238894 2627074 7
## 3 0.1258887 11.424016 5235764 7
## 4 0.1524062 7.884831 2479481 7
## 5 0.1503052 7.741240 4342911 7
cluster.dem = left_join(cluster.winner, ct.clust[,c(1,2,29)], by=c('County','State')) %>% filter(Winner=='Hillary Clinton')%>%dplyr::group_by(Cluster) %>% dplyr::summarize(Dem=length(Winner)) %>% add_row(Cluster=c(5,7,10), Dem=0)
cluster.rep = left_join(cluster.winner, ct.clust[,c(1,2,29)], by=c('County','State')) %>% filter(Winner=='Donald Trump')%>%dplyr::group_by(Cluster) %>% dplyr::summarize(Rep=length(Winner)) %>% add_row(Cluster=c(7,9), Rep=0)
cluster.dem
## # A tibble: 11 x 2
## Cluster Dem
## <dbl> <dbl>
## 1 1 375
## 2 2 53
## 3 3 17
## 4 6 2
## 5 7 5
## 6 8 14
## 7 9 6
## 8 10 1
## 9 5 0
## 10 7 0
## 11 10 0
cluster.rep
## # A tibble: 9 x 2
## Cluster Rep
## <dbl> <dbl>
## 1 1 2037
## 2 2 457
## 3 3 1
## 4 4 4
## 5 6 28
## 6 8 3
## 7 10 79
## 8 7 0
## 9 9 0
cluster.perc = left_join(cluster.dem, cluster.rep, by='Cluster') %>% mutate(rep.pc = Rep*100/(Dem+Rep),dem.pc = Dem*100/(Dem+Rep), Total = Dem+Rep)%>% select(Cluster, rep.pc, dem.pc)
cluster.perc = melt(cluster.perc, id.var="Cluster")
cluster.perc <- na.omit(cluster.perc)
ggplot(data=cluster.perc, aes(x=Cluster, y=value,fill=variable))+
geom_bar(stat='identity',position='stack')
#plot clusters on graph of rep and dem winner percentages, size of cluster = number of counties within cluster
county_perc <- election %>%
dplyr::group_by(county) %>%
dplyr::mutate(ctvotes = sum(votes)) %>%
dplyr::group_by(candidate)
## Warning: Factor `county` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `county` contains implicit NA, consider using
## `forcats::fct_explicit_na`
demperc = county_perc %>%
dplyr::filter(candidate=='Hillary Clinton') %>%
dplyr::mutate(demvotes = votes/ctvotes)
repperc = county_perc %>%
dplyr::filter(candidate=='Donald Trump') %>%
dplyr::mutate(repvotes = votes/ctvotes)
county_perc = merge(repperc, demperc, by='county')
county_perc = dplyr::select(county_perc,-c(fips.x,candidate.x,state.x,votes.x,ctvotes.x,fips.y,candidate.y,state.y,votes.y,ctvotes.y))
poverty = census.ct %>% ungroup(State) %>% select(County,Poverty) %>% rename(c('County'='county'))
head(county_perc)
## county repvotes demvotes
## 1 Abbeville County 0.6276566 0.34719258
## 2 Acadia Parish 0.7726460 0.20584906
## 3 Accomack County 0.5444339 0.42752934
## 4 Ada County 0.4792410 0.38684531
## 5 Adair County 0.2314963 0.04614580
## 6 Adair County 0.2314963 0.03951866
head(poverty)
## county Poverty
## 1 Autauga 12.91231
## 2 Baldwin 13.42423
## 3 Barbour 26.50563
## 4 Bibb 16.60375
## 5 Blount 16.72152
## 6 Bullock 24.50260
povgraph = merge(county_perc,poverty,by='county')
povgraph
## county repvotes demvotes Poverty
## 1 Alexandria city 0.1782336 0.7679676 8.318699
## 2 Baltimore city 0.1065602 0.8568487 23.543020
## 3 Bristol city 0.7011610 0.2630070 20.586898
## 4 Buena Vista city 0.5975763 0.2895947 28.400000
## 5 Carson City 0.5246642 0.3841541 16.606984
## 6 Charlottesville city 0.1329799 0.8042140 26.726417
## 7 Chesapeake city 0.4826659 0.4699846 9.626140
## 8 Colonial Heights city 0.6741426 0.2808829 11.384550
## 9 Covington city 0.5694386 0.3858168 25.420642
## 10 Danville city 0.3870780 0.5861557 23.890419
## 11 District of Columbia 0.0417535 0.9281727 17.985582
## 12 Emporia city 0.3338976 0.6474820 34.771580
## 13 Fairfax city 0.3119838 0.6208495 7.011226
## 14 Falls Church city 0.1724407 0.7578797 2.739134
## 15 Franklin city 0.3514717 0.6230522 21.044709
## 16 Fredericksburg city 0.3353036 0.6006627 16.534144
## 17 Galax city 0.6780880 0.2880711 26.712420
## 18 Hampton city 0.2892224 0.6674314 15.034269
## 19 Harrisonburg city 0.3508320 0.5721329 30.895483
## 20 Hopewell city 0.4331104 0.5266444 19.092334
## 21 Lexington city 0.3129085 0.6184641 20.200000
## 22 Lynchburg city 0.5085551 0.4183376 24.478488
## 23 Manassas city 0.3901816 0.5520745 9.696282
## 24 Manassas Park city 0.3332052 0.6160354 7.954758
## 25 Martinsville city 0.3657250 0.6012594 23.456408
## 26 Newport News city 0.3389452 0.6071148 15.931351
## 27 Norfolk city 0.2601328 0.6882680 20.436758
## 28 Norton city 0.7012363 0.2630495 24.000000
## 29 Petersburg city 0.1056348 0.8751456 27.855932
## 30 Poquoson city 0.7152690 0.2248911 4.161605
## 31 Portsmouth city 0.2972678 0.6620743 18.043051
## 32 Radford city 0.4371168 0.4846727 43.979094
## 33 Richmond city 0.1516916 0.7911113 26.021640
## 34 Roanoke city 0.3771261 0.5683029 21.181484
## 35 Salem city 0.5962046 0.3466997 11.301574
## 36 St. Louis city 0.1591638 0.7963922 27.248264
## 37 Staunton city 0.4602349 0.4781673 17.608792
## 38 Suffolk city 0.4183259 0.5408545 11.541498
## 39 Virginia Beach city 0.4875003 0.4518053 8.284522
## 40 Waynesboro city 0.5264254 0.4127193 18.192662
## 41 Williamsburg city 0.2551020 0.6899019 21.606615
## 42 Winchester city 0.4526982 0.4880446 16.028795
ggplot(data=povgraph, aes(x=demvotes, y=repvotes, size=Poverty))+
geom_point(alpha=0.2)
qplot(x=demvotes, y=repvotes,data=povgraph,geom='point')
This bar chart shows the percentage of counties in each cluster that Clinton had the majority of votes in blue, and the percentage of counties in each cluster where Trump had the majority of votes in red. Each bar on the graph represents a different cluster of the ones found in #13. Just by looking at the chart we can see that within each the clusters there is a clear candidate who won the majority of counties. This indicates that the demographics in our census - that were used to find the clusters - could be good predictors of which candidate will win in a particular area. To see how effective this is in practice, we can look at the results of classification methods that use census data to predict the winner of each county. In our decision trees, the splitting variables chosen are % of people who commute using public transportation, % of white people,median household income, and unemployment rates. This supports the media narrative that Donald Trump, as a populist and political outsider, was more popular among white majority, low income or disadvantaged regions where people may have felt left behind by the previous Democrat administration. The transportation variable does not immediately seem like an obvious thing that would define voter behaviour in the way that it does, but perhaps it is linked to the urban/ruralness of a county, with more metropolitan areas (often thought of as Democrat areas) having a higher of proportion of people using public transport. A variable that more directly takes into account the divide between rural and urban populations could certainly be a way to take our predictions further. The maps which show winners of each state and county also seems to show a pattern of Democrats in the West and East Coasts, but Repbulicans in the South, so predictions which take into account geographical differences that aren’t included in the census data would be an improvement. Our predictions would be better and more useful for the future if they took into account data from polls and previous elections - right now our model is predicting 2016 winners from 2016 election data, so is very biased towards the 2016 candidates and might not be accurate for prediction future elections. Also, looking at opinion polls on the candidates per region could have helped predict the areas where our classifications were not accurate.
Since this project does not include anything about linear regression, it seems logical to want to try something with this approach. It seemed that probabilities for a candidate winning in a particular state would be a very interesting thing to see in a model.
To get the dataframes in order to run a GLM, there is ALOT of preprocessing that must take place in order to create variables that make sense.
censusomit <- na.omit(census)
census.partly = censusomit %>%
mutate(Minority=Hispanic+Black+Native+Asian+Pacific) %>%
select(-c(Walk,PublicWork, Construction, Women,Hispanic,Black,Native, Asian,Pacific))
census.partly = census.partly[,c(1:6,29,7:28)]
census.partly = census.partly[,-c(11,12,21,22,25,28)]
census.partly.1 = census.partly %>%
mutate(White = White * TotalPop/100,Minority = Minority * TotalPop/100,Income = Income * TotalPop,IncomeErr = IncomeErr * TotalPop, Poverty = Poverty * TotalPop/100,ChildPoverty = ChildPoverty* TotalPop/100,Professional = Professional * TotalPop/100,Service = Service* TotalPop/100, Office= Office * TotalPop/100, Production = Production * TotalPop/100,Drive =Drive * TotalPop/100, Carpool =Carpool * TotalPop/100,WorkAtHome = WorkAtHome * TotalPop/100,MeanCommute = MeanCommute * TotalPop, PrivateWork =PrivateWork * TotalPop/100,SelfEmployed = SelfEmployed* TotalPop/100,Unemployment = Unemployment * TotalPop/100)
head(census.partly.1) #this was a check to make sure the data frame was structured correctly
## # A tibble: 6 x 23
## CensusTract State County TotalPop Men White Minority Citizen Income
## <fct> <fct> <fct> <int> <int> <dbl> <dbl> <int> <dbl>
## 1 1001020100 Alab… Autau… 1948 940 1703. 185. 1503 1.20e8
## 2 1001020200 Alab… Autau… 2156 1059 871. 1216. 1662 6.96e7
## 3 1001020300 Alab… Autau… 2968 1364 2211. 617. 2335 1.33e8
## 4 1001020400 Alab… Autau… 4423 2172 3662. 699. 3306 2.40e8
## 5 1001020500 Alab… Autau… 10763 4922 7373. 3154. 7666 5.59e8
## 6 1001020600 Alab… Autau… 3851 1787 2807. 963. 2642 2.43e8
## # … with 14 more variables: IncomeErr <int>, Poverty <dbl>,
## # ChildPoverty <dbl>, Professional <dbl>, Service <dbl>, Office <dbl>,
## # Production <dbl>, Drive <dbl>, Carpool <dbl>, WorkAtHome <dbl>,
## # MeanCommute <dbl>, PrivateWork <dbl>, SelfEmployed <dbl>,
## # Unemployment <dbl>
census.clean <- census.partly.1[,-c(1:3)]
state.agg <- aggregate(census.clean, by = list(censusomit$State), FUN = sum)
head(state.agg)
## Group.1 TotalPop Men White Minority Citizen Income
## 1 Alabama 4821879 2336973 3198032.7 1544679.3 3612759 2.273719e+11
## 2 Alaska 729562 381155 454217.5 220428.0 520405 5.448047e+10
## 3 Arizona 6531748 3244624 3658444.5 2734872.4 4422184 3.587520e+11
## 4 Arkansas 2956316 1450077 2176308.3 722219.9 2162204 1.301689e+11
## 5 California 38221472 18933999 14798574.8 22274252.5 24104414 2.595552e+12
## 6 Colorado 5244194 2624207 3623511.6 1490037.2 3719828 3.480935e+11
## IncomeErr Poverty ChildPoverty Professional Service Office
## 1 37310400288 914060.42 1301940.51 1543899.4 830325.2 1156555.4
## 2 6799683284 74375.72 98642.74 261295.7 130172.9 167417.2
## 3 58462248048 1212375.16 1594641.64 2192625.6 1343046.7 1694348.6
## 4 20801553400 575913.89 802528.03 917514.9 517308.6 699590.6
## 5 416929902916 6254156.77 7866421.30 13626162.8 7334918.1 9069943.0
## 6 49586490679 673457.25 863188.32 2068097.1 932978.0 1248202.1
## Production Drive Carpool WorkAtHome MeanCommute PrivateWork
## 1 802936.21 4104159.5 454296.11 130845.19 118130870 3770505.3
## 2 78711.96 487711.8 91294.51 32789.69 14039764 487312.2
## 3 659793.84 4979501.5 742567.41 356571.13 163208691 5123172.6
## 4 498092.89 2431565.6 324384.01 94065.50 64205732 2276148.6
## 5 4446300.88 28146054.4 4268716.71 1976834.91 1071163291 29653036.4
## 6 483155.31 3965575.9 510223.01 343853.44 130472067 4153410.8
## SelfEmployed Unemployment
## 1 254065.30 466281.76
## 2 45360.82 63043.47
## 3 392744.37 609359.38
## 4 188110.76 234137.31
## 5 3133745.01 3882719.54
## 6 335755.81 372650.14
state.census = state.agg %>%
mutate(Men = Men/TotalPop,White = White/TotalPop,Minority = Minority/TotalPop,Citizen = Citizen/TotalPop,Income = Income/TotalPop,IncomeErr = IncomeErr/TotalPop,Poverty = Poverty/TotalPop,ChildPoverty = ChildPoverty/TotalPop,Professional = Professional/TotalPop,Service = Service/TotalPop, Office= Office/TotalPop, Production = Production/TotalPop,Drive =Drive/TotalPop, Carpool =Carpool/TotalPop,WorkAtHome = WorkAtHome /TotalPop,MeanCommute = MeanCommute/TotalPop, PrivateWork =PrivateWork/TotalPop,SelfEmployed = SelfEmployed/TotalPop,Unemployment = Unemployment/TotalPop)
colnames(state.census)[colnames(state.census)=="Group.1"] <- "state"
state.census <- state.census[-c(40),] #get rid of Puerto Rico
head(state.census)
## state TotalPop Men White Minority Citizen Income
## 1 Alabama 4821879 0.4846602 0.6632337 0.3203480 0.7492430 47154.21
## 2 Alaska 729562 0.5224436 0.6225894 0.3021375 0.7133115 74675.58
## 3 Arizona 6531748 0.4967467 0.5601019 0.4187045 0.6770292 54924.35
## 4 Arkansas 2956316 0.4905014 0.7361555 0.2442973 0.7313846 44030.77
## 5 California 38221472 0.4953760 0.3871796 0.5827680 0.6306511 67908.21
## 6 Colorado 5244194 0.5004023 0.6909568 0.2841308 0.7093231 66376.93
## IncomeErr Poverty ChildPoverty Professional Service Office
## 1 7737.731 0.1895652 0.2700069 0.3201863 0.1721995 0.2398557
## 2 9320.227 0.1019457 0.1352082 0.3581542 0.1784260 0.2294763
## 3 8950.475 0.1856127 0.2441370 0.3356874 0.2056183 0.2594020
## 4 7036.309 0.1948080 0.2714622 0.3103575 0.1749842 0.2366427
## 5 10908.264 0.1636294 0.2058116 0.3565054 0.1919057 0.2372997
## 6 9455.503 0.1284196 0.1645989 0.3943594 0.1779068 0.2380160
## Production Drive Carpool WorkAtHome MeanCommute PrivateWork
## 1 0.16651936 0.8511536 0.09421558 0.02713573 24.49893 0.7819577
## 2 0.10788934 0.6684995 0.12513605 0.04494435 19.24410 0.6679517
## 3 0.10101336 0.7623536 0.11368586 0.05459046 24.98699 0.7843494
## 4 0.16848432 0.8224985 0.10972576 0.03181849 21.71816 0.7699274
## 5 0.11632992 0.7363938 0.11168373 0.05172053 28.02517 0.7758214
## 6 0.09213147 0.7561841 0.09729293 0.06556841 24.87934 0.7920018
## SelfEmployed Unemployment
## 1 0.05269010 0.09670126
## 2 0.06217542 0.08641277
## 3 0.06012852 0.09329193
## 4 0.06363013 0.07919901
## 5 0.08198912 0.10158477
## 6 0.06402429 0.07105956
Winner <- c("Trump","Trump","Trump","Trump", "Clinton","Clinton","Clinton","Clinton","Clinton","Trump","Trump","Clinton","Trump","Clinton","Trump","Trump","Trump","Trump","Trump","Clinton","Clinton","Clinton","Trump","Clinton" ,"Trump","Trump","Trump","Trump","Clinton","Clinton","Clinton","Clinton","Clinton","Trump","Trump","Trump","Trump","Clinton","Trump","Clinton","Trump","Trump","Trump","Trump","Trump","Clinton","Clinton","Clinton","Trump","Trump","Trump")
Winner <- as.factor(Winner)
state.census <- cbind(Winner,state.census)
head(state.census)
## Winner state TotalPop Men White Minority Citizen
## 1 Trump Alabama 4821879 0.4846602 0.6632337 0.3203480 0.7492430
## 2 Trump Alaska 729562 0.5224436 0.6225894 0.3021375 0.7133115
## 3 Trump Arizona 6531748 0.4967467 0.5601019 0.4187045 0.6770292
## 4 Trump Arkansas 2956316 0.4905014 0.7361555 0.2442973 0.7313846
## 5 Clinton California 38221472 0.4953760 0.3871796 0.5827680 0.6306511
## 6 Clinton Colorado 5244194 0.5004023 0.6909568 0.2841308 0.7093231
## Income IncomeErr Poverty ChildPoverty Professional Service
## 1 47154.21 7737.731 0.1895652 0.2700069 0.3201863 0.1721995
## 2 74675.58 9320.227 0.1019457 0.1352082 0.3581542 0.1784260
## 3 54924.35 8950.475 0.1856127 0.2441370 0.3356874 0.2056183
## 4 44030.77 7036.309 0.1948080 0.2714622 0.3103575 0.1749842
## 5 67908.21 10908.264 0.1636294 0.2058116 0.3565054 0.1919057
## 6 66376.93 9455.503 0.1284196 0.1645989 0.3943594 0.1779068
## Office Production Drive Carpool WorkAtHome MeanCommute
## 1 0.2398557 0.16651936 0.8511536 0.09421558 0.02713573 24.49893
## 2 0.2294763 0.10788934 0.6684995 0.12513605 0.04494435 19.24410
## 3 0.2594020 0.10101336 0.7623536 0.11368586 0.05459046 24.98699
## 4 0.2366427 0.16848432 0.8224985 0.10972576 0.03181849 21.71816
## 5 0.2372997 0.11632992 0.7363938 0.11168373 0.05172053 28.02517
## 6 0.2380160 0.09213147 0.7561841 0.09729293 0.06556841 24.87934
## PrivateWork SelfEmployed Unemployment
## 1 0.7819577 0.05269010 0.09670126
## 2 0.6679517 0.06217542 0.08641277
## 3 0.7843494 0.06012852 0.09329193
## 4 0.7699274 0.06363013 0.07919901
## 5 0.7758214 0.08198912 0.10158477
## 6 0.7920018 0.06402429 0.07105956
Finally the preprocessing is over and we can begin to run the GLM.
glm <- glm(Winner ~.-state,data = state.census, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm)
##
## Call:
## glm(formula = Winner ~ . - state, family = binomial, data = state.census)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.309e-05 -2.110e-08 2.110e-08 9.782e-07 1.212e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.171e+03 4.792e+07 0 1
## TotalPop 6.582e-06 2.381e-02 0 1
## Men -1.529e+03 6.551e+07 0 1
## White -6.165e+02 6.301e+06 0 1
## Minority -8.562e+02 6.472e+06 0 1
## Citizen 6.005e+02 4.297e+06 0 1
## Income 1.230e-03 1.258e+02 0 1
## IncomeErr 1.890e-03 2.334e+02 0 1
## Poverty 1.797e+03 2.477e+07 0 1
## ChildPoverty -2.677e+02 1.861e+07 0 1
## Professional -1.050e+03 2.721e+07 0 1
## Service -1.569e+03 1.643e+07 0 1
## Office 2.076e+03 1.693e+07 0 1
## Production -5.594e+02 1.611e+07 0 1
## Drive -1.503e+02 3.819e+06 0 1
## Carpool 3.182e+02 1.504e+07 0 1
## WorkAtHome 1.289e+03 2.241e+07 0 1
## MeanCommute -1.193e+01 1.311e+05 0 1
## PrivateWork -8.060e+02 9.503e+06 0 1
## SelfEmployed -4.331e+03 2.264e+07 0 1
## Unemployment -1.017e+03 1.349e+07 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6.9104e+01 on 50 degrees of freedom
## Residual deviance: 1.4293e-09 on 30 degrees of freedom
## AIC: 42
##
## Number of Fisher Scoring iterations: 25
When using all the predictors provided in the dataset, you get that there is nothing significant about the data or any of the percentages provided in the created dataframe for the model. This can be misleading because it is likely that many of the columns in the dataframe are related. For example, white and minority are related, because if someone is not white, they are a minority. Additionally the job types are less important because levels of work are likely represented better by income than actual job type. So in this sense I will get rid of many of the variables that seem unimportant to help clear the data.
library(arm)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
## Loading required package: lme4
##
## arm (Version 1.10-1, built: 2018-4-12)
## Working directory is /Users/joshuakurtz/Documents/PSTAT 131 Project/2016-election
glm1 <- glm(Winner ~Men + Income + SelfEmployed + Citizen + Unemployment,data = state.census, family = binomial)
str(Winner)
## Factor w/ 2 levels "Clinton","Trump": 2 2 2 2 1 1 1 1 1 2 ...
summary(glm1)
##
## Call:
## glm(formula = Winner ~ Men + Income + SelfEmployed + Citizen +
## Unemployment, family = binomial, data = state.census)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5220 -0.1607 0.1479 0.3549 1.3790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.615e+01 4.246e+01 -1.793 0.072910 .
## Men 2.236e+02 8.988e+01 2.488 0.012843 *
## Income -4.004e-04 1.195e-04 -3.350 0.000809 ***
## SelfEmployed -1.425e+02 5.850e+01 -2.436 0.014846 *
## Citizen 4.999e+00 1.513e+01 0.330 0.741126
## Unemployment -6.887e+01 3.998e+01 -1.723 0.084917 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 69.104 on 50 degrees of freedom
## Residual deviance: 28.286 on 45 degrees of freedom
## AIC: 40.286
##
## Number of Fisher Scoring iterations: 7
In this model, it become apparent that the significant predictors are income, male percentage, and whether someone is self employed or not. Being that these are the biggest predictors, we will refit the model with just these predictors.
library(arm)
glm2 <- glm(Winner ~Men + Income + SelfEmployed,data = state.census, family = binomial)
summary(glm2)
##
## Call:
## glm(formula = Winner ~ Men + Income + SelfEmployed, family = binomial,
## data = state.census)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8742 -0.1781 0.2595 0.5632 1.2146
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.942e+01 3.328e+01 -2.386 0.017020 *
## Men 2.122e+02 7.808e+01 2.717 0.006588 **
## Income -3.330e-04 9.998e-05 -3.331 0.000866 ***
## SelfEmployed -9.082e+01 4.388e+01 -2.070 0.038464 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 69.104 on 50 degrees of freedom
## Residual deviance: 32.675 on 47 degrees of freedom
## AIC: 40.675
##
## Number of Fisher Scoring iterations: 6
In this model, all predictors are significant and as we can see, there is important information to be gained from this.
A one unit increase in the percentage of men has a 2.122e+02 increase in log likelihood of voting for Trump. This implies men are more likely to vote for Trump and Women are more likely to vote for Hillary.
A one unit increase in income has a 3.330e-04 decrease in log likelihood of voting for Trump. This implies more wealthy people are more likely to vote for Hillary than Trump which is surprising because Trump gives many tax breaks to the elite. This could be because Hillary wins states that are more wealthy by nature while Trump wins poorer states.
A one unit increase in the percentage of SelfEmployed has a 9082e+01 decrease in log likelihood of voting for Trump. This implies Self Employed people are more likely to vote for Hillary than Trump.
Finally, the intercept where all covariates are 0 means nothing because there is no such state with no income, nobody self employed, and no males.
# Specify type="response" to get the estimated probabilities
prob.training = predict(glm2, type="response")
# Round the results to a certain number of decimal places by round(). For instance, # we can round prob.training to 2 decimal places
round(prob.training, digits=2)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 0.95 0.71 0.90 0.99 0.01 0.24 0.00 0.16 0.00 0.82 0.73 0.04 0.97 0.35 0.96
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 0.79 0.82 0.99 0.96 0.31 0.00 0.00 0.88 0.23 0.99 0.86 0.92 0.68 0.98 0.01
## 31 32 33 34 35 36 37 38 39 41 42 43 44 45 46
## 0.00 0.98 0.02 0.82 0.77 0.85 0.96 0.53 0.48 0.25 0.93 0.87 0.75 0.51 0.79
## 47 48 49 50 51 52
## 0.08 0.01 0.30 1.00 0.92 0.94
#Save the predicted labels using 0.5 as a threshold
state.census = state.census %>%
mutate(predWinner=as.factor(ifelse(prob.training<=0.5, "Clinton", "Trump")))
# Confusion matrix (training error/accuracy)
table(pred=state.census$Winner, true=state.census$predWinner)
## true
## pred Clinton Trump
## Clinton 18 3
## Trump 1 29
When using my created model to predict the outcomes of the actual election, it is very accurate in doing this. This is a testament to the importance of income, sex, and self-employment status in the election.
The numbers above coordinate with the order of the states in state.census, so for example 1 is associated with Alabama where Trump has a 95 percent chance to win and 5 is California where trump has a 1 percent chance to win according to the model.
According to the Confusion Matrix -out of the 50 predicted states, the model classifies 18 + 29 = 47 correctly (.94) - this means 6 percent or 3 states were predicted incorrectly
Overall, we can have confidence in our model because the predictors that are important make sense, and the effects of each predictor on the likelihood of a candidate winning makes sense according to their platform and its effect amongst Americans today. This means that logistic regression has a powerful place amongst model prediction in politics.